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Abstract 

A specific type of neural network, tlie Restricted Boltzmann Machine (RBM), is implemented for 
classification and feature detection in machine learning. RBM is characterized by separate layers of 
visible and hidden units, which are able to learn efficiently a generative model of the observed data. 
We study a "hybrid" version of RBM's, in which hidden units are analog and visible units are binary, 
and we show that thermodynamics of visible units are equivalent to those of a Hopfield network, in 
which the A*' visible units are the neurons and the P hidden units are the learned patterns. 
We apply the method of stochastic stability to derive the thermodynamics of the model, by considering 
a formal extension of this technique to the case of multiple sets of stored patterns, which may act as 
a benchmark for the study of correlated sets. 

Our results imply that simulating the dynamics of a Hopfield network, requiring the update of A*" 
neurons and the storage of N{N — l)/2 synapses, can be accomplished by a hybrid Boltzmann 
Machine, requiring the update of A^ + P neurons but the storage of only NP synapses. In addition, 
the well known glass transition of the Hopfield network has a counterpart in the Boltzmann Machine: 
It corresponds to an optimum criterion for selecting the relative sizes of the hidden and visible layers, 
resolving the trade-off between flexibility and generality of the model. The low storage phase of 
the Hopfield model corresponds to few hidden units and hence a overly constrained RBM, while the 
spin-glass phase (too many hidden units) corresponds to unconstrained RBM prone to overfitting of 
the observed data. 

1 Introduction 

A common goal in Machine Learning is to design a device able to reproduce a given system, namely to 
estimate the probability distribution of its possible states jTS] . When a satisfactory model of the system is 
not available, and its underlying principles are not known, this goal can be achieved by the observation of 
a large number of samples [TT] . A well studied example is the visual world, the problem of estimating the 
probability of all possible visual stimuli [23J . A fundamental ability for the survival of living organisms is 
to predict which stimuli will be encountered and which are more or less likely to occur. On this purpose, 
the brain is believed to develop an internal model of the visual world, to estimate the probability and 
respond to the occurrence of various events [S],[Z]- 

Ising-type neural networks have been widely used as generative models of simple systems [16].[3]. 
Those models update the synaptic weights between neurons according to a specific learning rule, depend- 
ing on the neural activity driven by a given set of observations; after learning, the network is able to 
generate a sequence of states whose probabilities match those of the observations. Popular examples of 
Ising models, characterized by a quadratic energy function and a Boltzmann distribution of states, are 
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Figure 1: Left panel: Schematic representation of a Hybrid Boltzmann Machine (HBM) where the hidden 
units are analog (z,t variables) and the visible units are digital (cr variables). The two sets of hidden 
units, z and r, represent two feature sets that are both connected to the layer of visible units a . The 
layers of hidden and visible units are reciprocally connected, but there are no intra-layer connections, thus 
forming a bipartite topology. Right panel: Schematic representation of the equivalent Hopfield neural 
network built upon the visible units only, with an internal fully connected structure. 



the Hopfield model [2] [19] and Boltzmann Machines (BM) [IT]- Boltzmann Machines (BM) have been 
designed to capture the complex statistics of arbitrary systems by dividing neurons in two subsets, visible 
and hidden units: marginalizing the Boltzmann distribution over the hidden units allows the BM to re- 
produce, through the visible units, arbitrarily complex distributions of states, by learning the appropriate 
synaptic weights |T7]. State of the art feature detectors and classifiers implement a specific type of BM, 
the Restricted Boltzmann Machine (RBM), because of its efficient learning algorithms [S]. The RBM 
is characterized by a bipartite topology in which hidden and visible units are coupled, but there is no 
interaction within either set of visible or hidden units [TH| ■ 

All neurons of RBM's are binary, both the visible and the hidden units. The analog equivalent of 
RBM, the Restricted Diffusion Networks, have all analog units and have been described in [201 [S]- Here 
we study the case of a "hybrid" Boltzmann Machine (HBM), in which the hidden units are analog and 
the visible units are binary (Fig.l left). We show that the HBM, when marginalized over the hidden 
units, is equivalent to a Hopfield network (Fig.l right), where the N visible units are the neurons and 
the P hidden units are the learned patterns. Although the Hopfield network can generate probability 
distributions in a limited space, it has been widely studied for its associative and retrieval properties. 
The exact mapping proven here introduces a new way to simulate Hopfield networks, and allows a novel 
interpretation of the spin glass transition, which translates into an optimal criterion for selecting the 
relative size of the hidden and visible layers in the HBM. 

We use the method of stochastic stability to study the thermodynamics of the system in the case of 
analog synapses. This method has been previously described in [T], [1], and offers an alternative approach 
to the replica trick for studying Ising-type neural networks, including the Hopfield model and the HBM. 
We analyze the model with two non-interacting sets of hidden units in the HBM, which corresponds 
to two sets of uncorrelated patterns in the Hopfield network, and study the thermodynamics with the 
assumption of replica symmetry. We extend the theory to cope with two sets of interconnected hidden 
layers, coresponding to sets of correlated patterns, and we show that their interaction acts as a noise 
source for retrieval. 
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2 Statistical equivalence of HBM and Hopfield networks 



We define a "hybrid" Boltzmann Machine (HBM, see Fig. f left) as a network in which the activity 
of units in the visible layer is discrete, ct^ = ±1, i e {1,...,N) (digital layer), and the activity in 
the hidden layer is continuous (analog layer). The layers of hidden and visible units are reciprocally 
connected, but there are no intra-layer connections, thus forming a bipartite topology. We assume 
that the layer of hidden units is further divided into two sets, both described by continuous variables, 
Zfi, Tj/ £ Jft, II G (1, .., P), V £ (1, K). We will consider the case of interacting hidden units (connections 
between z and r) in the next Section. In order to maintain a parsimonious notation, in this section we 
consider a single hidden layer, e.g. only the layer defined by the variables z. 

The synaptic connections between units in the two layers are fixed and symmetric, and are defined 
by the synaptic matrix . The input to unit (Ji in the visible (digital) layer is the sum of the activities 
in the hidden (analog) layer weighted by the synaptic matrix, i.e. ^m- input to unit in the 

hidden (analog) layer is the sum of the activities in the visible (digital) layer, weighted by the synaptic 
matrix, i.e. S.i'^i- In the following, we denote by z the set of all hidden {z^} variables, and by a the 
set of all visible {ui} variables. 

The dynamics of the activity is different in the two layers; in the analog layer it changes continuously 
in time, while in the digital layer it changes in discrete steps. The activity in the hidden (analog) layer 
follows the stochastic differential equation 

r^ = -z,(t) + Ee>. + ^c.(i), (1) 

where C is a white gaussian noise with zero mean and covariance {C,fj.{t)C,y{t')) — 5^i, 6{t — t'). The 
parameter T quantifies the timescale of the dynamics, and the parameter /3 determines the strength of 
the fluctuations. The first term in the right hand side is a leakage term, the second term is the input 
signal and the third term is a noise source. Since noise is uncorrelated between different hidden units, 
they evolve independently. Eq.(IT]) describes an Ornstein-Uhlembeck diffusion process and, for fixed 
values of a, the equilibrium distribution of z^ is a Gaussian distribution centered around the input signal, 
which is equal to 



Priz^la) 




(2) 



In order for this equilibrium distribution to hold, the activity of digital units a must be constant, while in 
fact it depends on time. However, we assume that the timescale of diffusion T is much faster than the rate 
at which the digital units are updated. Therefore, a different equilibrium distribution for z, characterized 
by different values of a, holds between each subsequent update of a. Since hidden units are independent, 
their joint distribution is the product of individual distributions, i.e. Pr{z\cr) = Jl^=i -^'''i^t^W)- 

The activity in the visible (digital) layer follows a standard Glauber dynamics for Ising-type systems 
[2]. At a specified sequence of time intervals (much larger than T), the activity of units in the digital 
layer is updated randomly according to a probability that depends on their input. While updating the 
digital units cr, the analog variables z are fixed, namely the update of digital units is instantaneous. The 
activity of a unit ai is independent on other units, and the probability is a logistic function of its input, 
i.e. 

exp[/?o'i J2u ^i'^fi] 

Pr{a,\z) = — — — - (3) 

exp[/3 C + exp[-/3 C^mI 

Note that this distribution is normalized, namely Pr{(Ji = +l|z) + Pr{ai = — l|z) ~ 1. Since visible 
units are independent, their joint distribution is the product of individual distributions, i.e. Pr{a\z) = 
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Given the conditional distributions of either layers, Eas. (|2l3p . we can determine their joint distribution, 
Pr{a,z), and the marginal distributions Pr{z) and Pr(a), apart from a normalization factor. We use 
Bayes' rule, Pr{a,z) = Pr{z\a)Pr{a) — Pr{a\z)Pr{z), and we use the fact that marginal distributions 
depend on single layer variables. The result is, for the joint distribution 



As explained in more detail in the next section, this probability distribution is equal to the distribution 
of a Hopfield network, where the synaptic weights of the Hopfield network are given by the expression 
in round brackets. The stored patterns of the Hopfield model corresponds to the synaptic weights of the 
HBM, described by the ^ variables, and the number of patterns corresponds to the number P of hidden 
units. 

Therefore, we have shown that the HBM and Hopfield network admit the same probability distribu- 
tion, once the hidden variables of the HBM are marginalized, and the HBM and Hopfield network are 
statistically equivalent. In other words, a configuration a in the Hopfield network has the same probabil- 
ity as the same configuration a in the HBM, when averaged over the hidden configurations z. Retrieval 
in the Hopfield network corresponds to the case in which the HBM learns to reproduce a specific pattern 
of neural activation. The maximum number of patterns P that can be retrieved in a Hopfield network 
is known [2], and is equal to 0.f4 • iV. If the number of patterns exceeds this limit, the network is not 
able to retrieve any of them. On the other hand, if the HBM has a very large number P of hidden vari- 
ables, this provokes over-fitting in learning the observed patterns, and the HBM is not able to reproduce 
the statistics of the observed system. The correspondence between Hopfield network and HBM allows 
recognizing that the maximum number of hidden variables in the HBM is 0.f4 • N. 

We check this prediction by numerical simulations of the HBM. We pick each element of the synaptic 
matrix independently from a Bernouilli distribution, = or = —1/^/N with 50% probability 

(the scaling with N is imposed for comparison to the original Hopfield model). We set the number of 
neurons in the visible layer as TV = fOOO, the timescale of dynamics of hidden units is T = f , and we 
use T as a reference time unit. In each simulation, we update the visible units every IT and we run the 
simulation for lOOOT, therefore performing one thousand updates of the visible units. We simulate the 
dynamics of hidden units by standard numerical methods for stochastic differential equations and using 
a time step of O.OIT. In different simulations we vary the values of the noise amplitude by manipulating 
/3 = 0.5, 2, 10, and the number of hidden units P = 50, 100, 150, 200. We observe the overlap of the 
activity of visible units with each one of the pattern fi by computing "^^£,1 (^i/ ^/N ^ such that overlap 
equal to one for some value of ^ implies that visible units precisely align to that pattern /i. In each 
simulation, we initialize the hidden units at random and the visible units exactly aligned to one of the 
patterns. 

Fig. 2 shows the results of simulations, the dynamics of the overlap of visible units with all patterns 
for different values of the parameters /3 and P. For high noise, /3 — 0.5, no retrieval is possible and all 
overlaps are near zero regardless of the number of hidden units P. For intermediate noise, 13 — 2, retrieval 
is possible provided that the number of hidden units is not too large. The prediction of the Hopfield 
network is that retrieval is lost at about P ~ 0.06iV = 60 [2], accurately matching our findings. For 
low noise, /? = 10, retrieval is maintained up to large values of P. In the low noise regime, the Hopfield 
network can retrieve a number of patterns near its maximum, i.e. P — 0.14iV = 140, which again matches 
well with the results of our simulations. 





The marginal distribution of the visible units is equal to 
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Figure 2: Dynamics of the overlap of visible units with all patterns for different values of the parameters 
/3 and P. Simulations run for 1000 units of time, which corresponds to 1000 updates of the visible units. 
Thick blue line in each panel shows the overlap of visible units with the pattern imposed by the initial 
condition, other lines show the overlaps with all other patterns. No retrieval is observed (overlap'^ 0) for 
high noise, (3 = 0.5, while for intermediate 13 — 2 and low noise /3 = 10 retrieval is possible (overlaps 1) 
for a small number of hidden units (patterns) P. Results of simulations match with the theory of Hopfield 
networks. 



3 Thermodynamic theory of HBM 

In canonical statistical mechanics, a system is described by the probability distribution of each one of its 
possible states. In the HBM, a given state is associated with its probability according to the Boltzmann 
distribution. This distribution is expressed by Eq.(|4]), which we rewrite while reintroducing the variables 
T dropped in previous section and by defining the Hamiltonian function 

Hhbm{cr, 2;, r; ry) = i I ^ + ^ j - E '^M E ^'^^/^ + E ^i'^'^ ) ' 

We denote by ^ the set of all {^f } variables, and by 77 the set of all \r\\^ variables, where r\\ is the synaptic 
matrix for the connections with the r layer. The Boltzmann distribution depends on the parameters 
/3,^, 77, and its expression includes the normalization factor Z: 

Pr{cr, z, t) = exp [-l3Hhb„^{a, z, r; ^, rf)] Z{(3, ^, r])~^ (7) 
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The partition function Z corresponds to the normahzation factor of the Boltzmann distribution, and is 
defined as 

P K 

Zil3,^,r]) YidZf, JJ^dr,, exp (-/3i/,ib„((T, z, r; ^, 77)), (8) 

Using this definition of the partition function, it is straightforward to show that the Bolzmann distribution, 
Eq-O, is normaUzed. In order to marginalize the hidden variables, we use the following identity, the 
Gaussian integral: 



+00 



dz exp 




Using this identity, we marginalize the analog variables z and r in Eq.®, and we obtain 



P + K 

2n^ 




(9) 



Zif3,^,v}^ [-j) 5]exp(-/3i/,„p(a;e,r;)), (10) 
where we define the following Hamiltonian: 

Hhopia; ^, ^) = - § 1^ 11^ ee,^ + 1^ '7r^; ) ^r'^j ■ (H) 

This is the Hamiltonian of a Hopfield neural network. This result connects the two Hamiltonians of 
the Hopfield network and the Boltzmann Machine and states that thermodynamics obtained by the 
first cost function, Eq.®, is the same as the one obtained by the second one, Ea. (|lll) . This offers 
a connection between retrieval through free energy minimization in the Hopfield network and learning 
through log-likelihood estimation in the HBM [2] [S] . Note that observable quantities stemmed from HBM 
are equivalent in distribution, and not pointwise, to the corresponding ones in the Hopfield network. 

Next, we calculate the free energy, which allows determining the value of all relevant quantities and 
the different phases of the system. The thermodynamic approach consists in averaging all observable 
quantities over both the noise and the configurations of the system. Therefore, we define two different 
types of averaging, the average to over the state configurations a, z, r, and the average E over the synaptic 
weights (quenched noise) ^, rj. Note that a given HBM is defined by a fixed and constant value of the 
synaptic weights ^, 77. However, those synaptic weights are taken at random from a given distribution, 
and different realizations of the synaptic weights correspond to different HBM's. Since we are interested 
in determining the average behavior of a "typical" HBM, we average the relevant quantities over the 
distribution of synaptic weights. 

The average w of a given observable O under the Boltzmann distribution is defined as 

p K 

uj{0) ^ Z{p-i,ri)-^Y. / Il'^^M / X{dT,0{a,z,T)e^^{-pHnt,rr.{o,z,T-^,in)). (12) 
The average E of a given observable F over the distribution of synaptic weights is defined as 

nF{£.,ri)] = / d^x{0 f d^,irm(,rJ), (13) 



where /i is the standard Gaussian measure, d/i(5) = exp(— ^^/2) / \/2tt. Note that the standard Hopfield 
network is built with random binary patterns ^, while we use Gaussian patterns here: Despite retrieval 
with the former choice has been extensively studied, we have chosen the latter in order to show a novel 
technique, stochastic stability, for studying the related thermodynamics. For finite this is known to 
be equivalent to the former case, despite for infinite neurons a complete picture of the quality of the 
retrieval is still under discussion. 
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We define the free energy as 

^(/3) = lE[logZ(/3;e,ry)]. (14) 

Since the free energy is proportional to the logarithm of the partition function, and due to the additive 
properties of the logarithm, \og{A ■ B) = log A + log B, we neg lect the factor (27r//3)(^+^)/2 in Z{P;^,r]) 
(see Eq. ([TU])), as it gives a negligible contribution to the free energy in the thermodynamic limit. We 
also neglect the factor in the Bolzmann average w as it appears both at the numerator and denominator 
and therefore it cancels out. Thermodynamics can be described by the standard Gaussian measure. 

In the HBM, parameters P and K determine the number of neurons in the hidden layers, while in 
the Hopfield model they represent the number of patterns stored in the network, or the number of stable 
states that can be retrieved. We consider the "high storage" regime, in which the number of stored 
patterns is linearly increasing with the number of neurons [2J. In HBM, this corresponds to the case 
in which the sizes of the hidden and visible layers are comparable. Their relative size is quantified by 
defining two control parameters a, 7 G as 

P K 
a = lim — 7 = lim — . (15) 

We further introduce the order parameters q,p,r, called overlaps, as 

N P K 

These objects describe the correlations between two different realizations of the system (two different 
replicas a, h). We also define the averages of these overlaps with respect to both state configurations and 
synaptic weights (quenched noise). Since the overlaps involve two realizations of the system ((T",a^), 
the Boltzmann average is performed over both configurations. With some abuse of notation, we use the 
symbol w to also represent the Boltzmann average over two-system configurations. Therefore, the average 
overlaps are defined as 



g = Ea;(gQb), p = E iLj(pab), r = Ea;(rab). (17) 

The goal of next section is to find an expression for the free energy in terms of these order parameters. 
While all configurations of the system are possible, only a subset of them has a significant probability to 
occur. In canonical thermodynamics, those states are described by the minima of the free energy with 
respect to the order parameters. The free energy is the difference among the energy and the entropy, and 
its minimization corresponds to energy minimization and entropy maximization. 



3.1 Multiple-layer stochastic stability 

By definition of HBM, we assume that no external field acts on the network; inputs to all neurons are 
generated internally by other neurons. The overall stimulus felt by an element of a given layer is the sum, 
synaptically weighted, of the activity of neurons in the other layers. Note that neurons are connected 
in loops, because a neuron receiving input from a layer also projects back to same layer. Therefore, the 
HBM is a recurrent network, and this makes the calculation of the free energy complicated. However, 
the free energy can be calculated in specific cases by using a novel technique that has been developed in 
[1], which extends the stochastic stability developed for the analysis of spin glasses [IL This technique 
introduces an external field acting on the system which "imitates" the internal, recurrently generated 
input, by reproducing its average statistics. While the external, fictitious input does not reproduce the 
statistics of order two and higher, it represent correctly the averages. These external inputs are denoted 
as 77 (one for each neuron in each layer) and are distributed following the standard Gaussian distribution 
AA[0,1]. 

In order to recover the second order statistics, the free energy is interpolated smoothly between the 
case in which all inputs are external, and all high order statistics is missing, and the case in which all 
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inputs are internal, describing the original HBM. We use the interpolating parameter t € [0, 1], such that 
for t = the inputs are all external and the calculation straightforward, while for t = 1 the full HBM is 
recovered. 

Therefore, we define the interpolating free energy as 



1 



= -Elog^ / []dz^ / H'^^^'exp -^(E^M + E^' 



i/=i 



f3 



exp (/? E ^'C'^^M - E '^''^''^ 



K 



u=l 



(18) 



N 



K 



exp VI - tia E rii(T, + 5 E ''a'^m + ^ E ^"'^A " 9 X! + o E '^^ 



K 



v=l 



In addition to the fictitious fields 77, we have introduced the auxiliary parameters a, 6, c, which serve to 
weight the external fields. We also introduced an additional leakage (second order) term, parameterized 
by h and e. Those parameters are chosen once for all in Appendix 1 and 2 in order to separate the 
contribution of mean and fluctuations of the order parameters in the final expression of the free energy. 
This technique is called multiple layer stochastic stability because each of the three layers are perturbed 
by external fictitious inputs to simplify the expression of the free energy. 

The free energy at t = is characterized by one-body terms and is calculated in Appendix 1. The 
result is Ea. (|¥T|) and is equal to 



i(/3,t = 0) 



+ 



log2- 



d-Kv) log cosh{y/ I3{ap + -fr)rj) + '^-^ log(l - /3(1 - g)) ^ 



(19) 



2 l-/3(l-g)- 

In order to derive the expression of the free energy for the HBM, namely for i = 1, we use the sum rule 



i(/3,t = 1) = i(/3,t = 0) 



dt 



\ dt 



t=t' 



(20) 



Therefore, we need to compute the derivative of the interpolating free energy in order to recover the free 
energy of the HBM {t = 1). We calculate the derivative in Appendix 2 fEa. ([5(I)) '). and the result is 



— = 5'(a,/3,7) + -(g- l)(ap + 7r) , 

where the function S is the source of the fluctuations of the order parameters, and is equal to 



^(a,/3,7) 



(('712 - g)[a(pi2 -p)+ 7(^12 - r)]). 



(21) 



(22) 



In the following, we neglect the contribution of fluctuations, therefore we set 5 = 0. The integral in 
Eq. (|20p is calculated by substituting the derivative in Eq. (|2ip with 5 = and, since the latter does not 
depend on t, it can be integrated simply multiplying by one. Further, we substitute the expression of the 
free energy at t = 0, Ea. (119p . and we obtain the final expression for the free energy of the HBM {t — 1). 
The resulting expression is called A^^ , since it does not include fluctuations of the overlaps, and this 
corresponds to the replica symmetric {RS) solution in statistical mechanics. 



A 



RS 



= log2+ / d^i(ri) log cosh(\//3(ap + ir)ri) + log(l - /3(1 - g)) 



(23) 



/3(a + 7) 



1 - /3(1 - q) 



+ I3{q -l){ap + jf)/2- l3{a + -f)/2. 
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In Appendix 3, we derive the free energy in the case in which an additional external input is applied to 
the HBM, in order to force the retrieval of stored patterns. In the next section, we minimize the free 
energy with respect to the order parameters, in order to study the phases of the system. 



3.2 Free energy minimization and phase transition 

We minimize the free energy (|23p with respect to the order parameters g, p, f. This is accomplished by 
imposing the following equations 

dqA^^ = 0, 9pA«^ ^ 0, drA^^ = 0. 

This gives the following system of integro-differential equations to be simultaneously satisfied 

dp A''' ^^(^q-J dfiirj) tanh^ (vV P{c,p + -,f)) = 0, (25) 
^(q~ J d^l{l^) tanh^ (77 V/3(ap + 7^)) = 0, (26) 



drA 



RS 



Note that, since the two hidden layers act symmetrically on the visible layer, in the sense that the synaptic 
weights are distributed identically, one of the above equations is redundant and the minimization condition 
is summarized by the following two equations 

(1 - /3(1 - g))2 ' ^^'> 



d,(.)tanh^(^M^j. (28) 

These equations describe a minimum of the free energy, as can be checked by calculating the second-order 
derivatives of the free energy and verifying that the Hessian has a positive determinant. The minima of 
free energy in the case of imposed retrieval are discussed in Appendix 3. 

Next, we study the phase transitions of the system by looking at divergences of the rescaled order 
parameters. If the overlap q is zero, then all neurons in the visible layer are uncorrelated, implying that 
all neurons have random activity and the system has no structure. The value of parameters for which 
the transition to g = occurs corresponds to the case in which the fluctuations of \/Nq diverge, and this 
defines the critical region. To evaluate the critical region, we study for which values of the parameters 
a, /?, 7 the squared order parameter Ncf' diverges. This is obtained by expanding the hyperbolic tangent 
in Ea. (|28p to the second order, which gives a meromorphic expression for the overlap. This expression 
diverges at the critical region, which is characterized by 

/3 = T— (29) 

The above equations are consistent with and generalize the results obtained in [2J. Since the hidden 
layers are not connected, and z,r are conditionally independent, they are equivalent to a single hidden 
layer of P + A' neurons. Therefore, the equivalent Hopfield network stores P + K independent patterns. 
The case of interacting (correlated) patterns is studied in the next section. 

3.3 Analysis of interacting hidden layers 

In this section we study the case in which the two hidden layers are connected by mild interactions. When 
the hidden units in the two separate layers interact, the performance of the network may change. We 
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study this case for small interaction strengths, within a mean field approximation, in order to be able to 
obtain approximate results. We show that the two hidden layers act reciprocally as an additional noise 
source affecting the retrieval of stored patterns in the visible layer, i.e. the retrieval of the activities a. 

We introduce the interacting energy of the HBM, denoted by i?/, where / stands for "interacting 
layers " : 

i?/(cr, z, r; ^, r/) = -j= ^ efcr.z^ + -= ^ (TiTfe + ^ ^f,kZf,Tk, (30) 

if2 ik 

where the last term accounts for the interaction between hidden layers, and its strength is controlled by 
the parameter e, which is assumed to be small. 

The rigorous analysis of this model is complicated and still under investigation [S]- However, for small 
e, exact bounds can be obtained by first-order expansion. We will proceed as follows: first we marginalize 
over one layer (either r or z) and we find an expression of the interacting partition function depending on 
the two remaining ones. Then, because of the symmetry between the hidden layers, we perform the same 
operation marginalizing the interacting partition function with respect to the other hidden layer. Last, 
we sum the two expression and divide the result by two: This should represent the average behavior of 
the neural network, whose properties are then discussed. 

The interacting partition function Z/, associated to the energy (j30p . can be written as 



We start integrating over the r variables, and we find 

n N OiN „ P iN jN 

(7 ij U /J,— 1 /J, /i/i' 

where the effective inputs $ and ^ are given by 



(31) 



(32) 



7=Eer-^+4E-^(E^rc)> (33) 



= ^(ECC')- (34) 

V 

Next, we use the mean field approximation by which ^'^^'Z^' ^ —(a[Pl2)z^. Therefore, we can 
bound the expression above with the partition function 

„ N aN jN 

If we perform the same procedure, integrating first on z and then on r, we obtain the specular result 



(36) 



To obtain the final equation for the partition function, we sum the two Hamiltonians and divide by two, 
to find 
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Retaining only the first order terms in e, we obtain an equivalent Hamiltonian for a HBM where the hidden 
layers interact. This corresponds to a Hopfield model with an additional noise source, characterized by 
the Hamiltonian 

„ N aN jN 

H{a- = ^ E ( E erejfl - + E - ^/3'«/4]) . (37) 

Note that for e = we recover the standard Hopfield model. The effect of the additional noise source on 
retrieval of patterns corresponding to one layer depends on the load of the other layer: the larger number 
of neurons in one layer, the larger the perturbation on the retrieval of the other layer. 

4 Conclusions 

We demonstrate an exact mapping between the Hopfield network and a specific type of Boltzmann 
Machine, the Hybrid Boltzmann Machine (HBM), in which the hidden layer is analog and the visible 
layer is digital. This type of structure is novel, since previous studies have investigated the cases in which 
both types of layers are either analog or digital. The thermodynamic equivalence demonstrated in our 
study paves the way to a novel procedure for simulating large Hopfield networks; In particular, Hopfield 
networks require updating N neurons and storing N(N — l)/2 synapses, while HBM require updating 
N + P neurons and storing only NP synapses, where P is the number of stored patterns. 

In addition, the well known phase transition of the Hopfield model has a counterpart in the HBM. In 
Boltzmann Machines, the ratio between the sizes of the hidden and visible layers is arbitrary and needs 
to be adjusted in order to obtain the optimal generative model of the observed data. If the number 
of hidden units is too small, the generative model is over-constrained and is not able to learn, while 
if it is too big then the model "overlearns" (overfits) the observed data and is not able to generalize 
[5]. Interestingly, these two extrema correspond in the Hopfield model to, respectively, the low storage 
phase, in which only a few patterns can be represented, and the spin glass phase, in which there is 
an exponentially increasing number of stable states. Therefore, the corresponding phase transition in 
the HBM can be understood as the optimal trade-off between flexibility and generality, thus effectively 
representing a statistical regularization procedure [5]. 

Furthermore we showed that, if hidden layers are disconnected, the corresponding patterns contribute 
linearly to the capacity of the Hopfield network. Therefore, conditional independence among layers 
corresponds to linearity of the energy function. Instead, if the hidden layers interact, we show that 
they affect retrieval by acting as an effective noise source. Although the replica trick has represented 
a breakthrough for studying the thermodynamics of the Hopfield model, we argue that the "natural" 
mathematical backbone required for studying the thermodynamics of the Boltzmann machine is the 
stochastic stability, whose implementation is tractable. 

Our work further contributes on connecting scientific communities quite far apart, such as the math- 
ematical physicists studying spin glasses (see i.e. |TD]) and the computer scientists studying machine 
learning and artificial intelligence (see i.e. |21|). 
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Appendix 1 

In this appendix, we calculate the interpolating free energy A{p,t) for t = 0. This calculation involves 
only one-body terms and is equal to 

P K 

i(/3,t = 0) = ^log^ / Hdi^iz^) I []d^(r,)e''^"'^"^-+^^"-i'^''^''+'=^"'^^^^e*^"-i^^tE?=ir^. 

Due to the additive properties of the logarithm (i.e. log(j4 ■ B ■ C) = log A + log B + log C) and the 
one-body factorization within each layer, the equation above can be rewritten as a sum of three separate 
terms, one for each layer: 

A{/3,t = 0) = ^log^e'^^"^'--+ (38) 

a 

p 

= ^logy" J]dM(^^)e''^^=i'''''''e*^M=i4 (39) 



= |log/nciM(r.)e^^?'^-"ei^--^. (40) 

We show in Appendix 2 shows that the following choice of the parameters substantially simplifies the 

calculations, a = ^ I3{ap + jf), b — ^fP^, c = \/^, h = e = /3{1 — q). Using these values of 
the parameters, and performing the integrals and sums in the above expression, we find 

A{p,t = 0) = log2 + y dfiir,) logcosh(V/3(ap + 7r)r?) + log(l - ^(1 - q))-' 

+ I . (41) 

2 l-/3(l-g) ^ ' 

Appendix 2 

In this section we focus on the ^-derivative of A(/3, t). Since the interpolating parameter t appears seven 
times in the exponential, this derivative includes seven different terms. Their derivation is long but 
straightforward, here we report the result for each of the seven terms 

-^{H2Pi2) + ^^J2^izl), (42) 
ft 

-^(ai2ri2) + ^E^a;(r,2), (43) 



a2 



- y (1 - (3i2», (44) 

^(p^^)-^^T.^(4)^ (45) 

^<^12)-^IEE^(^')' (46) 

4E^E-(^^)' (47) 

At M 

^EeE-(-^^)- (48) 

fj. V 
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Pasting the various terms together we obtain 



r^i t^'KoN ON ON ^'\0N ON ON ^^^^^^^ 



^ ' ^'V2A^ 2N 2NJ ^ ^ '''\2N 2N 2N 

- ^{qi2ri2) - y(l - (gi2» + — (P12) + ^(^12)- (49) 

We are left with the freedom of choosing the most convenient parameters; we see that with the particular 
choice 

a = V/3(ap + 7f), h = yf^, c = ^/M, h ^ e ^ f3{l ~ q), 
we can express the whole derivative as the source of the overlap fluctuations 

^ = -^((912 - q)[a{pi2 -p)+7{ri2 - r)]) + ^{q - l){ap + jr) - (50) 

The first term in the right hand side represents the fluctuations of each order parameter around its 
average (i.e. q,p,f), and we neglect this term within a replica symmetric approach. The second term 
includes only averages and does not depend on t. Its integration in t on the interval 0, 1 coincides with 
multiplication by one. 



Appendix 3 

In this section, we calculate the free energy in presence of an external field designed to force retrieval of the 
stored patterns. When the stored patterns are Gaussians, and in the thermodynamic limit, retrieval is not 
a spontaneous emergent feature of the network. However, it is possible to force retrieval by adding a proper 
Lagrange multiplier in the interpolating free energy as tm1 + {l—t)m,iMi, where mi = S^jai is the 

Mattis magnetization of the first condensed pattern (we chose the first because there is full permutational 
invariance among patterns) and Mi is its replica symmetric approximation. 

In analogy with the calculation performed in Section 4, we find the following expression 

\E(logZ(/?,e,r/)) + f / dt{(qi2-q)Hpi2~p)+j{ri2-f)]) 



= dt((mi-M)2)+A«^(p,g,f,M;a,/3,7), (51) 







Fluctuations of mi around Mi are now present. The final replica symmetric free energy can be written 
as 



A 



^■^(p,g,f,M;a,^,7) = log2+ / dfi{i]) log cosh {i^y/ I3{ap + jf) + /3'^M'^^ 



Q + 7 / 1 (a + 7)/3 

+ 7^ log f 



2 ° Vl-/3(l-g-)y 2 l-/3(l-g) 

- f(ap- + 7r-)(l-.-)-^^-fM^ (52) 

We have to minimize the free energy (j52p with respect to the replica symmetric order parameters q, p, f, M, 
namely we impose that 

a,-A^^(/3;a,7) = 0, 9pA«'^(/3; a, 7) = 0, a,A«^(/?; a, 7) = 0, aM^«^(/?; a, 7) - 0. 
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This gives the following system of integrodifferential equations to be simultaneously satisfied 

dpA"^^ = ^{q- j dn{T]) tanh^ (r]^/ I3{ap + jf) + p^^'^ = 0, (54) 

d-A'^s j ^^(-^) tanh^ {vVPictP + 7^) + P^M^^ = 0, (55) 

9mA^^= J dnir]) tanh (vVPiap + 7f) + p^M^') , (56) 
which can be solved numerically. 
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